Perform Muliple linear regression on Airbnb SF data

get airbnd data from a parquet file databrics github

https://github.com/databricks/LearningSparkV2/blob/master/databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean.parquet/part-00000-tid-4320459746949313749-5c3d407c-c844-4016-97ad-2edec446aa62-6688-1-c000.snappy.parquet

load the data and create a spark RDD

review columns of RDD

see the data to get undestanding

understand the data types in the RDD

get list of numerical columns

crate a correlation matix plot to get find correlations

price is log normal distribtuion

log tranformation of pirce to make it normal distribution

create train and test data for the Regression model

model 1:

Preparing Features for the multiple linear regression model with the numeric columns: "accommodates","bedrooms","bathrooms","beds"

Residual plot

Model 2

MLR using all numeric and categorical data

crete pipeline for Multiple linear regression (MLR)

calcualate RMSE and r2

Residual plot